mdsquadfandomcom-20200213-history
GPU queue on Raijin
Raijin has a GPU queue and it's huge! How to page from NCI: http://nci.org.au/systems-services/national-facility/peak-system/raijin/specialised-queues/ Details: https://opus.nci.org.au/display/Help/GPU Noteworthy Oddities The GPU queue has two different kinds of nodes; the first 14 nodes they put in used 12-core Haswell E5-2670v3 CPUs @ 2.3 GHz. When they expanded, they got 16 nodes with 14-core Broadwell E5-2690v4 CPUs @ 2.6 GHz. Each node has two identical CPUs and 4 NVIDIA Tesla K80 cards, which each have two GPUs. So a node has either 24 or 28 CPU cores alongside 8 GPUs. It looks like when you request 4 GPUs/12 CPUs (ie, half a node), and you're assigned a 2x14-core node, you get the extra two cores for free - so you actually get 2 GPUs/14 CPUs about half the time. Of course, what's optimal on 14 CPUs is not necessarily what's optimal on 12. On top of that, GROMACS' hardware detection detects the entire node, not the subsection of it you've been assigned. This means that it assigns way more threads than are available, unless you've selected entire nodes. I found the best way around this was to explicitly tell GROMACS how to parallelise each job. In addition, the Raijin environment has the environment variable OMP_NUM_THREADS set to 1 by default. This is a long way from optimal for GROMACS, so I always unset it. These optimisations are all implemented in the restartable resubmit script below. Example resubmit scripts Key elements: * Load the correct GROMACS module - gromacs/-gpu. GROMACS prior to 4.6 isn't compatible with GPUs, and not all of the versions since then are on Raijin. You can check what's available with module avail gromacs * Sometimes Raijin will give you nonconsecutive GPUs, or GPUs not including GPU0. If this is the case, GROMACS won't correctly detect all GPUs and your job will be slow. This line renumbers them so GROMACS can detect them: :: export CUDA_VISIBLE_DEVICES=$(seq 0 $(( $PBS_NGPUS-1 )) | tr '\r\n' ',') * I get the best speed per GPU with 2 GPUs (6-7 CPUs), but this may be system dependent. Combining this with a 1-hour resubmit script means you really fly through the queue. 4 GPUs are about 1.8 times the speed of 2. 8 GPUs are about 2.3 times the speed of 2. Restartable resubmit script This script is restartable, meaning that if it gets killed because NCI goes down for whatever reason it'll start back up again from where it left off (losing at most 15 minutes of work). It also produces single output files for a whole production run, meaning you don't have to concatenate anything at the end. It just GROMPPs once, rather than for every submission. Finally, it delimits where the information goes better than the below script - your MDP says how long the simulation is, and your resubmit says how that's divided up into chunks. With the job counter, the MDP and the script and the way the script is submitted are all needed to know how long a simulation actually is. Existing simulations can be extended by changing the number of steps in the MDP file, deleting the TPR and then resubmitting. #!/bin/bash #PBS -P q95 #PBS -q gpu #PBS -l walltime=1:00:00 #PBS -l mem=8GB #PBS -l jobfs=100MB #PBS -l ngpus=2 #PBS -l ncpus=6 #PBS -l other=mpi:hyperthread #PBS -l wd #PBS -r y ## General-purpose resubmit script for GROMACS jobs on Raijin ## Don't assign a title: ## We assume that the title variable refers to the name of ## the script for resubmission ## Jobs will be split up automatically by mdrun to fill the time. ## nsteps is set in the mdp ## Starting structure and .mdp should have the same name as the ## script, and all be in the same folder. Output will also have ## the same name. ## The script automatically chooses the multithreading ## mode based on the number of CPUs it finds. # Define error function so we can see the error code given when something # important crashes errexit () { errstat=$? if [ $errstat != 0 ]; then # A brief nap so PBS kills us in normal termination # Prefer to be killed by PBS if PBS detected some resource # excess sleep 5 echo "Job returned error status $errstat - stopping job sequence $PBS_JOBNAME at job $PBS_JOBID" exit $errstat fi } # Guarantee GPUs are visible export CUDA_VISIBLE_DEVICES=$(seq 0 $(( $PBS_NGPUS-1 )) | tr '\r\n' ',') # Change to working directory - not necessary with #PBS -l wd cd $PBS_O_WORKDIR # Terminate the job sequence if the file STOP_SEQUENCE is found in pwd if [ -f STOP_SEQUENCE ]; then echo "STOP_SEQUENCE file found - terminating job sequence $PBS_JOBNAME at job $PBS_JOBID" exit 0 fi #### GROMACS time! #### # Load the gromacs module module load gromacs/2016.1-gpu # gromacs module includes openmpi # Let GROMACS choose how to parallelise everything, unless we specify something later: unset OMP_NUM_THREADS ## Define our mdrun command # First, are we running on a single node? Dictates whether we use real MPI or thread-MPI if (( $PBS_NGPUS <= 8 )); then ## Running on only 1 node (<= 8 GPU/24-28 CPU) - use thread MPI: mdrun_command="gmx mdrun" # GROMACS always detects the full node, even when we've only asked for a subset of it # So we need to tell GROMACS exactly what's up # Are we on a 14-core or 12-core node? if [ `nproc --all` -eq 56 ]; then # We're on a 14-core node export OMP_NUM_THREADS=7 num_cores=`echo "$PBS_NCPUS + $PBS_NCPUS/6" | bc` elif [ `nproc --all` -eq 48 ]; then # We're on a 12-core node export OMP_NUM_THREADS=6 num_cores=$PBS_NCPUS else echo "Node size couldn't be detected, exiting" >&2 exit 2 fi # Hyperthreading means we should have 2 threads per core num_threads=`echo "$num_cores * 2" | bc` # Number of ranks is easily calculated num_ranks=`echo "$num_threads / $OMP_NUM_THREADS" | bc ` # Put it all together mdrun_command="$mdrun_command -nt $num_threads -ntmpi $num_ranks -ntomp $OMP_NUM_THREADS" else ## Running on multiple nodes (> 8GPU/24-28 CPU) - use real MPI: # This hasn't been optimised by hand, but may not have the same problems as above since # each gmx_mpi instance runs on a full node mdrun_command='mpirun gmx_mpi mdrun' fi ## GROMPP if there's no TPR file (eg, this is the first submission) if [ ! -f ${PBS_JOBNAME}.tpr ]; then gmx grompp -f ${PBS_JOBNAME}.mdp -c ${PBS_JOBNAME}_start.gro -o ${PBS_JOBNAME}.tpr || errexit fi ## Figure out how much time we have left # Ensures we stay in time if job is restarted, or if we decide to add some # expensive preamble to the script (like tune_pme or something) # also means that changing the PBS job time sets mdrun's time limit automatically qstat_out="`qstat -f $PBS_JOBID`" PBS_WALLTIME=`echo "$qstat_out" | sed -rn 's/.*Resource_List.walltime = (.*)/\1/p'` IFS=: read h m s <<<"${PBS_WALLTIME%.*}" seconds_total=$((10#$s+10#$m*60+10#$h*3600)) walltime_used=`echo "$qstat_out" | sed -rn 's/.*resources_used.walltime = (.*)/\1/p'` IFS=: read h m s <<<"${walltime_used%.*}" seconds_used=$((10#$s+10#$m*60+10#$h*3600)) hours_remaining=`echo "scale=4; ($seconds_total - $seconds_used) / 3600" | bc -l` # Set mdrun's maximum hours so that it ends 0.05 hours (3 minutes) before walltime runs out, # rather than 0.99 * as many hours maxh=`echo "scale=4; ($hours_remaining - 0.05) / 0.99" | bc -l` ## run MD! $mdrun_command -v -deffnm ${PBS_JOBNAME} -cpi ${PBS_JOBNAME}.cpt -maxh ${maxh} -nb gpu || errexit # Notes: # -cpi: Continue from checkpoint if available, otherwise start new simulation # -maxh n: Write a checkpoint and terminate after 0.99 * n hours # -nb gpu: Die if GPU not usable (unfortunately, won't die if GPU isn't found) # Check the log file for the number of steps completed steps_done=`perl -n -e'/Statistics over (\d+) steps using (\d+) frames/ && print $1' ${PBS_JOBNAME}.log` # Check the mdp file for the number of steps we want steps_wanted=`perl -n -e'/nsteps *= *(\d+)/ && print $1' ${PBS_JOBNAME}.mdp` # Resubmit if we need to if (( steps_done < steps_wanted )); then echo "Job ${PBS_JOBID} terminated with ${steps_done}/${steps_wanted} steps finished." echo "Submitting next job in sequence $PBS_JOBNAME." qsub $PBS_JOBNAME fi Submission is easy - name the script , your starting structure _start.gro, and your MDP .mdp, put them all in the same folder with your topology and submit the script from that folder. The script itself doesn't need to be changed, just renamed - it takes it's name from the title of the job, which is in turn taken from the submit scripts name. Just don't use the -title flag with qsub. So it should be something like like: $ ssh raijin Welcome to Raijin! $ cd /path/to/simulation/directory $ ls clevername clevername.mdp clevername_start.gro topol.top topol_protein.itp posre_protein.itp $ qsub clevername 1234567.r-man1 Using a job counter This script is based on one I got from Nandhitha, converted to BASH and tweaked as I've needed it. You tell it NJOBS when you first submit it, and it does that many jobs. I personally think the above restartable script is better in most ways and won't update this one any more, but it's here for legacy/completeness reasons. #!/bin/bash #PBS -P q95 #PBS -l walltime=24:00:00 #PBS -l mem=8GB #PBS -l jobfs=100MB #PBS -q gpu #PBS -l ngpus=4 #PBS -l ncpus=12 #PBS -l other=mpi #PBS -l wd #PBS -v NJOBS,NJOB # Make GPUs visible if not assigned sequentially export CUDA_VISIBLE_DEVICES=$(seq 0 $(( $PBS_NGPUS-1 )) | tr '\r\n' ',') ECHO=/bin/echo # These variables are assumed to be set: # NJOBS is the total number of jobs in a sequence of jobs (defaults to 1) # NJOB is the number of the current job in the sequence (defaults to 1) # if [ -z ${NJOBS+set} ]; then $ECHO "NJOBS (total number of jobs in sequence) is not set - defaulting to 1" export NJOBS=1 fi if [ -z ${NJOB+set} ]; then $ECHO "NJOB (current job number in sequence) is not set - defaulting to 1" export NJOB=1 fi # # Quick termination of job sequence - look for a specific file # if [ -f STOP_SEQUENCE ]; then $ECHO "Terminating sequence at job number $NJOB" exit 0 fi $ECHO "This is job $NJOB" # # Pre-job file manipulation goes here ... # # INSERT CODE # # Script for RAIJIN # # Haven't tried other versions of openmpi - might be better? module load openmpi/1.6.3 module load gromacs/5.1.2-gpu # Let GROMACS parellelise things as it wants unset OMP_NUM_THREADS # Prepare filenames startnum=0 # Format numbers as two digits: 01, 02, ... , 09, 10, 11 etc. formatstring="%02g" printf -v i $formatstring $((NJOB + startnum - 1)) # Previous run number printf -v j $formatstring $((NJOB + startnum)) # This run number filename="TEMPLATE" # Grompp and mdrun; && for error catching gmx grompp -f ${filename}.mdp -c ${filename}.${i}.gro -t ${filename}.${i}.cpt -p topol.top -o ${filename}.${j} \ && \ mpirun gmx_mpi mdrun -v -deffnm ${filename}.${j} #wipe #pwd # # Check the exit status # errstat=$? if [ $errstat != 0 ]; then # A brief nap so PBS kills us in normal termination # Prefer to be killed by PBS if PBS detected some resource # excess sleep 5 $ECHO "Job number $NJOB returned an error status $errstat - stopping job sequence." exit $errstat fi # # Are we in an incomplete job sequence - more jobs to run ? # if (( $NJOB < $NJOBS )); then # # Post-job file manipulation (preparing for next job etc) goes here ... # # INSERT CODE HERE # # # Now increment counter and submit the next job # njob=$NJOB (( njob++ )) export NJOB=$njob $ECHO "Submitting job number $NJOB in sequence of $NJOBS jobs" # If we don't define a job name, the name of the script is used, so we can resubmit really easily: qsub $PBS_JOBNAME else $ECHO "Finished last job in sequence of $NJOBS jobs" fi Category:Guides Category:Raijin Category:GROMACS Category:GPU